1 Statistical Basics

You are exposed to statistics regularly. If you are a sports fan, then you have the statistics for your favorite player. If you are interested in politics, then you look at the polls to see how people feel about certain issues or candidates. If you are an environmentalist, then you research arsenic levels in the water of a town or analyze the global temperatures. If you are in the business profession, then you may track the monthly sales of a store or use quality control processes to monitor the number of defective parts manufactured. If you are in the health profession, then you may look at how successful a procedure is or the percentage of people infected with a disease. There are many other examples from other areas. To understand how to collect data and analyze it, you need to understand what the field of statistics is and the basic definitions.

1.1 What is Statistics?

Statistics is the study of how to collect, organize, analyze, and interpret data collected from a group.

There are two branches of statistics. One is called descriptive statistics, which is where you collect and organize data. The other is called inferential statistics, which is where you analyze and interpret data. First you need to look at descriptive statistics since you will use the descriptive statistics when making inferences.

To understand how to create descriptive statistics and then conduct inferences, there are a few definitions that you need to look at. Note, many of the words that are defined have common definitions that are used in non-statistical terminology. In statistics, some have slightly different definitions. It is important that you notice the difference and utilize the statistical definitions.

The first thing to decide in a statistical study is whom you want to measure and what you want to measure. You always want to make sure that you can answer the question of whom you measured and what you measured. The who is known as the observation and the what is the variable(s).

observation, or simply observations: a person or object that you are interested in finding out information about.

Variable: the measurement or observation of the observation

Having the observation and the variables is part of picture of a data set or data frame. To make a data set or data frame into what is called tidy data, it should be organized in a way that each row of the data frame is an observation, and the variables should be well defined and are easily identified. An example of a data frame that is tidy data is:

Table 1.1: Example of a Data frame
name	chidren	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
100%_Bran	N	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1	0.33	68.40297
100%_Natural_Bran	N	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1	1.00	33.98368
All-Bran	N	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1	0.33	59.42551
All-Bran_with_Extra_Fiber	N	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1	0.50	93.70491
Almond_Delight	N	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1	0.75	34.38484
Apple_Cinnamon_Cheerios	Y	G	C	110	2	2	180	1.5	10.5	10	70	25	1	1	0.75	29.50954

Collecting multiple variables from one observation makes sense. If you wanted to figure out the diameter of breast height of Ponderosa Pine trees in the Coconino National Forest, you need to physically measure a bunch of trees. While you are measuring the diameter, you might also want to measure the height of the tree, if the tree has a bark beetle infestation, the estimated age of the tree, the color of the bark, and how many branches it has. You may only want to estimate the average diameter at breast height, but now you have the ability to estimate other quantities too. No sense walking all over the forest and only measure one thing.

A large data frame is one that has at least 5 variables and at least 1000 units of observations. If a data frame only has 3 variables and 500 rows, that doesn’t make it not usable. The 1000 observations and 5 variables is just a guideline to work with.

If you put the observation and the variable into one statement, then you obtain a population.

Population: set of all values of the variable for the entire group of units of observations

Notice, the population answers who you want to measure and what you want to measure. Make sure that your population always answers both of these questions. If it doesn’t, then you haven’t given someone who is reading your study the entire picture. As an example, if you just say that you are going to collect data from the senators in the U.S. Congress, you haven’t told your reader want you are going to collect. Do you want to know their income, their highest degree earned, their voting record, their age, their political party, their gender, their marital status, or how they feel about a particular issue? Without telling what you want to measure, your reader has no idea what your study is actually about.

Sometimes the population is very easy to collect. Such as if you are interested in finding the average age of all of the current senators in the U.S. Congress, there are only 100 senators. This wouldn’t be hard to find. However, if instead you were interested in knowing the average age that a senator in the U.S. Congress first took office for all senators that ever served in the U.S. Congress, then this would be a bit more work. It is still doable, but it would take a bit of time to collect. But what if you are interested in finding the average diameter of breast height of all of the Ponderosa Pine trees in the Coconino National Forest? This would be impossible to actually collect. What do you do in these cases? Instead of collecting the entire population, you take a smaller group of the population, kind of a snap shot of the population. This smaller group is called a sample.

Sample: a subset from the population. It looks just like the population, but contains less data.

In today of big data, there is some confusion between really large data frames and populations. The population is a theoretical concept and even if you have a very large data frame, that doesn’t mean you have the population. Most populations are not actually able to be collected. They are considered an ideal that you are trying to make decisions about.

How you collect your sample can determine how accurate the results of your study are. There are many ways to collect samples. Some of them create better samples than others. No sampling method is perfect, but some are better than others. Sampling techniques will be discussed later. For now, realize that every time you take a sample you will find different data values. The sample is a snapshot of the population, and there is more information than is in the picture. The idea is to try to collect a sample that gives you an accurate picture, but you will never know for sure if your picture is the correct picture. Unlike previous mathematics classes where there was always one right answer, in statistics there can be many answers, and you don’t know which are right.

Once you have your data frame, either from a population or a sample, you need to know how you want to summarize the data. As an example, suppose you are interested in finding the proportion of people who like a candidate, the average height a plant grows to using a new fertilizer, or the variability of the test scores. Understanding how you want to summarize the data helps to determine the type of data you want to collect. Since the population is what we are interested in, then you want to calculate a number from the population. This is known as a parameter. As mentioned already, you can’t really collect the entire population. Even though this is the number you are interested in, you can’t really calculate it. Instead you use a number calculated from the sample, called a statistic, to estimate the parameter. Since no sample is exactly the same, the statistic values are going to be different from sample to sample. They estimate the value of the parameter, but again, you do not know for sure if your answer is correct.

Parameter: a number calculated from the population. Usually denoted with a Greek letter. This number is a fixed, unknown number that you want to find.

Statistic: a number calculated from the sample. Usually denoted with letters from the Latin alphabet, though sometimes there is a Greek letter with a $^$ (called a hat) above it. Since you can find samples, it is readily known, though it changes depending on the sample taken. It is used to estimate the parameter value.

One last concept to mention is that there are two different types of variables -- qualitative (categorical) and quantitative (numerical). Each type of variable has different parameters and statistics that you find. It is important to know the difference between them.

Qualitative or categorical variable: answer is a word or name that describes a quality of the observation

Quantitative or numerical variable: answer is a number, something that can be counted or measured from the observation

1.1.1 Example: Stating Definitions for Qualitative Variable

In 2010, the Pew Research Center questioned 1500 adults in the U.S. to estimate the proportion of the population favoring marijuana use for medical purposes. It was found that 73% are in favor of using marijuana for medical purposes. State the observation, variable, population, and sample.

1.1.1.1 Solution

Observation: a U.S. adult

Variable: the response to the question “should marijuana be used for medical purposes?” This is qualitative data since you are recording a person’s response — yes or no.

Population: set of responses of all adults in the U.S.

Sample: set of responses of 1500 adults in the U.S.

Parameter: proportion of all U.S. Adults who favor marijuana for medical purposes

Statistic — proportion of 1500 U.S. Adults who favor marijuana for medical purposes

1.1.2 Example: Stating Definitions for Qualitative Variable

A parking control officer records the manufacturer of every $5^{th}$ car in the college parking lot in order to determine the most common manufacturer. State the observation, variable, population, and sample.

1.1.2.1 Solution

Observation: a car in the college parking lot

Variable: the name of the manufacturer. This is qualitative data since you are recording a car type.

Population: set of names of the manufacturer of all cars in the college parking lot.

Sample: set of names of the manufacturer of the a particular number of cars in college parking lot

Parameter: proportion of each car type of all cars in the college parking lot

Statistic: proportion of each car type a particular number of cars in the college parking lot

1.1.3 Example: Stating Definitions for Quantitative Variable

A biologist wants to estimate the average height of a plant that is given a new plant food. She gives 10 plants the new plant food and measures the plant height on day 50. State the observation, variable, population, and sample.

1.1.3.1 Solution

Observation: a plant given the new plant food

Variable: the height of the plant on day 50 (Note: it is not the average height since you cannot measure an average -- it is calculated from data.) This is quantitative data since you will have a number.

Population: set of heights on day 50 of all plants when the new plant food is used

Sample: set of heights on day 50 of 10 plants when the new plant food is used

Parameter: average height on day 50 of all plants when the new plant food is used

Statistic: average height on day 50 of 10 plants when the new plant food is used

Note: in Example: Stating Definitions for Qualitative Variable, you most likely will be comparing the new plant food to an old plant food. So you would have more units of observations, but the plants given the new plant food are what you are interested in in this case. You may also want to have measurements on other days after you give the plant food. In your data frame you would need to have many variables besides just the height of the plant on day 50. Examples of variables would be plant_number, fertilizer (yes or no), height on day 20, height on day 30, height on day 50, and so forth. One other comment, you variable names should make sense to your reader, and be one word for ease in analyzing by a computer program.

1.1.4 Example: Stating Definitions for Quantitative Variable

A doctor wants to see if a new treatment for cancer extends the life expectancy of a patient versus the old treatment. She gives one group of 25 cancer patients the new treatment and another group of 25 the old treatment. She then measures the life expectancy of each of the patients. State the units of observations, variables, populations, and samples.

1.1.4.1 Solution

In this example there are two observations, two variables, two populations, and two samples.

Observation 1: cancer patient given new treatment

Observation 2: cancer patient given old treatment

Variable 1: life expectancy when given new treatment. This is quantitative data since you will have a number.

Variable 2: life expectancy when given old treatment. This is quantitative data since you will have a number.

Population 1: set of life expectancies of all cancer patients given new treatment

Population 2: set of life expectancies of all cancer patients given old treatment

Sample 1: set of life expectancies of 25 cancer patients given new treatment

Sample 2: set of life expectancies of 25 cancer patients given old treatment

Parameter 1: average life expectancy of all cancer patients given new treatment

Parameter 2: average life expectancy of all cancer patients given old treatment

Statistic 1: average life expectancy of 25 cancer patients given new treatment

Statistic 2: average life expectancy of 25 cancer patients given old treatment

There are different types of quantitative variables, called discrete or continuous. The difference is in how many values can the data have. If you can actually count the number of data values (even if you are counting to infinity), then the variable is called discrete. If it is not possible to count the number of data values, then the variable is called continuous.

Discrete data can only take on particular values like integers. Discrete data are usually things you count.

Continuous data can take on any value. Continuous data are usually things you measure.

1.1.5 Example: Discrete or Continuous

Classify the quantitative variable as discrete or continuous.

The weight of a cat.
The number of fleas on a cat.
The size of a shoe.

1.1.5.1 Solution

The weight of a cat.

This is continuous since it is something you measure.
The number of fleas on a cat.

This is discrete since it is something you count.
The size of a shoe.

This is discrete since you can only be certain values, such as 7, 7.5, 8, 8.5, 9. You can’t buy a 9.73 shoe.

There are also are four measurement scales for different types of data with each building on the ones below it. They are:

1.1.6 Measurement Scales:

Nominal: data is just a name or category. There is no order to any data and since there are no numbers, you cannot do any arithmetic on this level of data. Examples of this are gender, car name, ethnicity, and race.

Ordinal: data that is nominal, but you can now put the data in order, since one value is more or less than another value. You cannot do arithmetic on this data, but you can now put data values in order. Examples of this are grades (A, B, C, D, F), place value in a race (1st, 2nd, 3rd), and size of a drink (small, medium, large).

Interval: data that is ordinal, but you can now subtract one value from another and that subtraction makes sense. You can do arithmetic on this data, but only addition and subtraction. Examples of this are temperature and time on a clock.

Ratio: data that is interval, but you can now divide one value by another and that ratio makes sense. You can now do all arithmetic on this data. Examples of this are height, weight, distance, and length of time.

Nominal and ordinal data come from qualitative variables. Interval and ratio data come from quantitative variables.

Most people have a hard time deciding if the data are nominal, ordinal, interval, or ratio. First, if the variable is qualitative (words instead of numbers) then it is either nominal or ordinal. Now ask yourself if you can put the data in a particular order. If you can it is ordinal. Otherwise, it is nominal. If the variable is quantitative (numbers), then it is either interval or ratio. For ratio data, a value of 0 means there is no measurement. This is known as the absolute zero. If there is an absolute zero in the data, then it means it is ratio. If there is no absolute zero, then the data are interval. An example of an absolute zero is if you have \$0 in your bank account, then you are without money. The amount of money in your bank account is ratio data. Word of caution: sometimes ordinal data is displayed using numbers, such as 5 being strongly agree, and 1 being strongly disagree. These numbers are not really numbers. Instead they are used to assign numerical values to ordinal data. In reality you should not perform any computations on this data, though many people do. If there are numbers, make sure the numbers are inherent numbers, and not numbers that were assigned.

1.1.7 Example: Measurement Scale

State which measurement scale each is.

Time of first class
Hair color
Length of time to take a test
Age groupings (baby, toddler, adolescent, teenager, adult, elderly)

1.1.7.1 Solution

Time of first class

This is interval since it is a number, but 0 o’clock means midnight and not the absence of time.
Hair color

This is nominal since it is not a number, and there is no specific order for hair color.
Length of time to take a test.

This is ratio since it is a number, and if you take 0 minutes to take a test, it means you didn’t take any time to complete it.

Age groupings (baby, toddler, adolescent, teenager, adult, elderly)

This is ordinal since it is not a number, but you could put the data in order from youngest to oldest or the other way around.

1.1.8 Homework for What is Statistics Section

Suppose you want to know how Arizona workers age 16 or older travel to work. To estimate the percentage of people who use the different modes of travel, you take a sample containing 500 Arizona workers age 16 or older. State the observation, variable, population, sample, parameter, and statistic.
You wish to estimate the mean cholesterol levels of patients two days after they had a heart attack. To estimate the mean you collect data from 28 heart patients. State the observation, variable, population, sample, parameter, and statistic.
Print-O-Matic would like to estimate their mean salary of all employees. To accomplish this they collect the salary of 19 employees. State the observation, variable, population, sample, parameter, and statistic.
To estimate the percentage of households in Connecticut which use fuel oil as a heating source, a researcher collects information from 1000 Connecticut households about what fuel is their heating source. State the observation, variable, population, sample, parameter, and statistic.
The U.S. Census Bureau needs to estimate the median income of males in the U.S., they collect incomes from 2500 males. State the observation, variable, population, sample, parameter, and statistic.
The U.S. Census Bureau needs to estimate the median income of females in the U.S., they collect incomes from 3500 females. State the observation, variable, population, sample, parameter, and statistic.
Eyeglassmatic manufactures eyeglasses and they would like to know the percentage of each defect type made. They review 25,891 defects and classify each defect that is made. State the observation, variable, population, sample, parameter, and statistic.
The World Health Organization wishes to estimate the mean density of people per square kilometer, they collect data on 56 countries. State the observation, variable, population, sample, parameter, and statistic
State the measurement scale for each.

Cholesterol level
Defect type
Time of first class
Opinion on a 5 point scale, with 5 being strongly agree and 1 being strongly disagree

State the measurement scale for each.

Temperature in degrees Celsius
Ice cream flavors available
Pain levels on a scale from 1 to 10, 10 being the worst pain ever
Salary of employees

1.2 Sampling Methods

As stated before, if you want to know something about a population, it is often impossible or impractical to examine the whole population. It might be too expensive in terms of time or money. It might be impractical — you can’t test all batteries for their length of lifetime because there wouldn’t be any batteries left to sell. You need to look at a sample. Hopefully the sample behaves the same as the population.

When you choose a sample you want it to be as similar to the population as possible. If you want to test a new painkiller for adults you would want the sample to include people who are fat, skinny, old, young, healthy, not healthy, male, female, etc.

There are many ways to collect a sample. None are perfect, and you are not guaranteed to collect a representative sample. That is unfortunately the limitations of sampling. However, there are several techniques that can result in samples that give you a semi-accurate picture of the population. Just remember to be aware that the sample may not be representative. As an example, you can take a random sample of a group of people that are equally males and females, yet by chance everyone you choose is female. If this happens, it may be a good idea to collect a new sample if you have the time and money. There are many sampling techniques, though only four will be presented here.

The simplest, and the type that is desired for is a simple random sample. This is where you pick the sample such that every sample has the same chance of being chosen. This type of sample is actually hard to collect, since it is sometimes difficult to obtain a complete list of all observations. There are many cases where you cannot conduct a truly random sample. However, you can get as close as you can.

Now suppose you are interested in what type of music people like. It might not make sense to try to find the most popular type of music preferred by everyone in the U.S. You probably don’t like the same music as your parents. The answers vary so much you probably couldn’t find an answer for everyone all at once. It might make sense to look at people in different age groups, or people of different ethnicities. This is called a stratified sample. The issue with this sample type is that sometimes people subdivide the population too much. It is best to just have one stratification. Also, a stratified sample has similar problems that a simple random sample has.

If your population has some order in it, then you could do a systematic sample. This is popular in manufacturing. The problem is that it is possible to miss a manufacturing mistake because of how this sample is taken.

If you are collecting polling data based on location, then a cluster sample that divides the population based on geographical means would be the easiest sample to conduct. The problem is that if you are looking for opinions of people, and people who live in the same region may have similar opinions. As you can see each of the sampling techniques have pluses and minuses.

One last type of sample that is sometimes conducted is called a convenience sample. This sample is not one that should be conducted since the idea of a convenience sample is that the sample is collected using the most convenient process for the researcher. The researcher may ask people who they know or who are easy to get a old of, and it is in no way representative of the population.

A simple random sample (SRS) of size n is a sample that is selected from a population in a way that ensures that every different possible sample of size n has the same chance of being selected. Also, every observation associated with the population has the same chance of being selected.

Ways to select a simple random sample:

Put all names in a hat and draw a certain number of names out.
Assign each observation a number and use a random number table or a calculator or computer to randomly select the observations that will be measured.

1.2.1 Example: Choosing a Simple Random Sample

Describe how to take a simple random sample from a classroom.

1.2.1.1 Solution

Give each student in the class a number. Using a random number generator you could then pick the number of students you want to pick.

1.2.2 Example: How Not to Choose a Simple Random Sample

You want to choose 5 students out of a class of 20. Give some examples of samples that are *not* simple random samples.

1.2.2.1 Solution

Choose 5 students from the front row. The people in the last row have no chance of being selected. Choose the 5 shortest students. The tallest students have no chance of being selected. Ask your friend to pick numbers that have been assigned to each student. Your friend may prefer certain numbers and picks those. This is not known by your friend, but this happens.

1.2.3 Example: How to Choose a Simple Random Sample using R

You want to take a simple random sample of size 10 from a data frame known as NHANES Table 1.2, use these steps:

library("NHANES") # turns on the package NHANES in R
sample_NHANES<- # gives the new sample a name
  NHANES |> # states the dataframe to collect from
  slice_sample(n=10) # creates a random sample and saves it as Sample_NHANES
options(width = 60)
knitr::kable(sample_NHANES) #displays the sample just created

Table 1.2: Random Sample of size 10 from NHANES
ID	SurveyYr	Gender	Age	AgeDecade	AgeMonths	Race1	Race3	Education	MaritalStatus	HHIncome	HHIncomeMid	Poverty	HomeRooms	HomeOwn	Work	Weight	Length	HeadCirc	Height	BMI	BMICatUnder20yrs	BMI_WHO	Pulse	BPSysAve	BPDiaAve	BPSys1	BPDia1	BPSys2	BPDia2	BPSys3	BPDia3	Testosterone	DirectChol	TotChol	UrineVol1	UrineFlow1	UrineVol2	UrineFlow2	Diabetes	DiabetesAge	HealthGen	DaysPhysHlthBad	DaysMentHlthBad	LittleInterest	Depressed	nPregnancies	nBabies	Age1stBaby	SleepHrsNight	SleepTrouble	PhysActive	PhysActiveDays	TVHrsDay	CompHrsDay	TVHrsDayChild	CompHrsDayChild	Alcohol12PlusYr	AlcoholDay	AlcoholYear	SmokeNow	Smoke100	Smoke100n	SmokeAge	Marijuana	AgeFirstMarij	RegularMarij	AgeRegMarij	HardDrugs	SexEver	SexAge	SexNumPartnLife	SexNumPartYear	SameSex	SexOrientation	PregnantNow
65494	2011_12	male	3	0-9	NA	White	White	NA	NA	more 99999	100000	5.00	7	Own	NA	16.6	98.2	NA	95.0	18.40	Obese	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1_hr	0_hrs	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
59327	2009_10	female	44	40-49	529	White	NA	College Grad	Married	75000-99999	87500	5.00	8	Own	Working	51.0	NA	NA	165.4	18.64	NA	18.5_to_24.9	62	110	70	NA	NA	112	68	108	72	NA	2.46	6.05	61	0.452	NA	NA	No	NA	Vgood	0	30	None	Several	1	NA	NA	6	Yes	Yes	3	NA	NA	NA	NA	Yes	3	260	No	Yes	Smoker	NA	Yes	18	No	NA	Yes	Yes	20	30	0	No	Heterosexual	No
67274	2011_12	female	46	40-49	NA	Other	Asian	8th Grade	Married	75000-99999	87500	2.68	6	Own	Working	52.6	NA	NA	156.3	21.50	NA	18.5_to_24.9	82	119	63	120	60	118	64	120	62	13.81	1.47	7.32	78	0.897	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	8	No	No	1	1_hr	0_hrs	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
61155	2009_10	female	63	60-69	757	White	NA	High School	Married	55000-64999	60000	4.12	4	Own	Working	95.1	NA	NA	159.0	37.62	NA	30.0_plus	80	116	69	114	78	122	72	110	66	NA	1.32	5.35	202	NA	NA	NA	No	NA	Good	0	2	Several	None	4	4	18	8	Yes	No	NA	NA	NA	NA	NA	Yes	NA	0	No	Yes	Smoker	18	NA	NA	NA	NA	No	Yes	17	3	NA	No	NA	NA
59802	2009_10	male	44	40-49	538	White	NA	Some College	Married	75000-99999	87500	5.00	7	Own	Working	107.3	NA	NA	187.8	30.42	NA	30.0_plus	92	128	81	122	82	128	84	128	78	NA	0.96	6.00	269	1.681	NA	NA	No	NA	Good	28	10	Several	Several	NA	NA	NA	6	No	Yes	2	NA	NA	NA	NA	Yes	1	1	No	Yes	Smoker	16	Yes	15	Yes	17	Yes	Yes	18	4	1	No	Heterosexual	NA
55878	2009_10	female	20	20-29	241	White	NA	Some College	NeverMarried	55000-64999	60000	3.28	7	Own	Working	75.5	NA	NA	170.3	26.03	NA	25.0_to_29.9	60	110	61	114	56	110	60	110	62	NA	1.66	4.53	293	2.873	NA	NA	No	NA	Vgood	0	0	None	None	NA	NA	NA	7	No	Yes	2	NA	NA	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	No	NA	No	NA	No	No	NA	0	0	No	Heterosexual	No
52122	2009_10	male	67	60-69	812	White	NA	Some College	Married	more 99999	100000	5.00	10	Own	Working	104.0	NA	NA	179.3	32.35	NA	30.0_plus	58	140	76	144	74	134	74	146	78	NA	1.97	4.78	117	0.807	NA	NA	No	NA	Vgood	0	0	None	None	NA	NA	NA	8	No	Yes	5	NA	NA	NA	NA	Yes	2	364	No	Yes	Smoker	19	NA	NA	NA	NA	No	Yes	17	5	NA	No	NA	NA
59842	2009_10	male	23	20-29	283	Mexican	NA	8th Grade	Married	25000-34999	30000	1.37	5	Rent	Working	NA	NA	NA	NA	NA	NA	NA	58	116	64	114	64	118	62	114	66	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	No	No	NA	NA	NA	NA	NA	NA	NA	NA	No	Yes	Smoker	10	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
65927	2011_12	female	65	60-69	NA	White	White	Some College	Married	45000-54999	50000	3.06	7	Own	NotWorking	99.1	NA	NA	172.4	33.30	NA	30.0_plus	72	124	74	124	74	126	72	122	76	8.04	1.66	5.43	19	0.328	50	0.391	No	NA	Vgood	0	0	Several	None	3	2	25	7	Yes	No	4	4_hr	0_hrs	NA	NA	Yes	2	260	NA	No	Non-Smoker	NA	NA	NA	NA	NA	No	Yes	20	3	NA	No	NA	NA
56096	2009_10	male	40	40-49	490	White	NA	Some College	Married	65000-74999	70000	3.17	6	Own	Working	131.7	NA	NA	195.9	34.32	NA	30.0_plus	70	132	89	134	88	130	88	134	90	NA	1.11	5.38	107	1.698	NA	NA	No	NA	Fair	1	0	None	None	NA	NA	NA	6	No	Yes	5	NA	NA	NA	NA	Yes	3	12	NA	No	Non-Smoker	NA	Yes	15	No	NA	Yes	Yes	18	4	1	No	Heterosexual	NA

Stratified sampling is where you break the population into groups called strata, then take a simple random sample from each strata.

For example:

If you want to look at musical preference, you could divide the observations into age groups and then conduct simple random samples inside each group.
If you want to calculate the average price of textbooks, you could divide the observations into groups by major and then conduct simple random samples inside each group.

1.2.4 Example: How to Choose a Stratified Sample using R

To take a stratified sample using rStudio of size 20 from NHANES Table 1.3 using race as the strata, use these steps:

library("NHANES") # turns on the package NHANES in R
sample_NHANES<- # gives the new sample a name
  NHANES |> # states the dataframe to collect from
  group_by(Race1) |> # tells what variable is the strata
  slice_sample(n=20) # takes the random sample within each strata
options(width = 60)
knitr::kable(sample_NHANES) #displays the sample just created

Table 1.3: Stratafied Sample of size 100 from NHANES with Race as the Strata
ID	SurveyYr	Gender	Age	AgeDecade	AgeMonths	Race1	Race3	Education	MaritalStatus	HHIncome	HHIncomeMid	Poverty	HomeRooms	HomeOwn	Work	Weight	Length	HeadCirc	Height	BMI	BMICatUnder20yrs	BMI_WHO	Pulse	BPSysAve	BPDiaAve	BPSys1	BPDia1	BPSys2	BPDia2	BPSys3	BPDia3	Testosterone	DirectChol	TotChol	UrineVol1	UrineFlow1	UrineVol2	UrineFlow2	Diabetes	DiabetesAge	HealthGen	DaysPhysHlthBad	DaysMentHlthBad	LittleInterest	Depressed	nPregnancies	nBabies	Age1stBaby	SleepHrsNight	SleepTrouble	PhysActive	PhysActiveDays	TVHrsDay	CompHrsDay	TVHrsDayChild	CompHrsDayChild	Alcohol12PlusYr	AlcoholDay	AlcoholYear	SmokeNow	Smoke100	Smoke100n	SmokeAge	Marijuana	AgeFirstMarij	RegularMarij	AgeRegMarij	HardDrugs	SexEver	SexAge	SexNumPartnLife	SexNumPartYear	SameSex	SexOrientation	PregnantNow
67292	2011_12	male	8	0-9	NA	Black	Black	NA	NA	25000-34999	30000	0.99	7	Own	NA	26.5	NA	NA	135.6	14.40	NormWeight	12.0_18.5	76	100	0	104	40	100	0	100	0	NA	NA	NA	42	0.222	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3_hr	1_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
62969	2011_12	male	14	10-19	NA	Black	Black	NA	NA	65000-74999	70000	2.56	13	Own	NA	85.7	NA	NA	183.7	25.40	OverWeight	25.0_to_29.9	60	110	54	114	62	110	56	110	52	264.55	1.11	2.17	106	0.507	NA	NA	No	NA	Excellent	0	0	NA	NA	NA	NA	NA	NA	NA	Yes	6	0_to_1_hr	2_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
54689	2009_10	female	25	20-29	301	Black	NA	College Grad	NeverMarried	25000-34999	30000	1.34	5	Rent	Working	147.2	NA	NA	167.7	52.34	NA	30.0_plus	86	133	72	130	78	132	70	134	74	NA	0.88	4.60	86	0.723	NA	NA	No	NA	Good	7	2	None	None	NA	NA	NA	9	No	No	NA	NA	NA	NA	NA	No	1	5	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	19	2	0	No	Heterosexual	No
55109	2009_10	female	4	0-9	54	Black	NA	NA	NA	25000-34999	30000	0.97	4	Rent	NA	23.5	NA	NA	113.5	18.24	NA	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
63877	2011_12	female	20	20-29	NA	Black	Black	Some College	NeverMarried	NA	NA	0.32	4	Rent	Working	44.0	NA	NA	157.5	17.70	NA	12.0_18.5	66	99	39	106	48	100	42	98	36	35.32	1.86	4.55	26	0.055	39	0.368	No	NA	Good	0	0	None	None	1	NA	NA	9	No	No	2	4_hr	4_hr	NA	NA	Yes	3	30	NA	No	Non-Smoker	NA	Yes	17	No	NA	No	Yes	17	5	2	No	Heterosexual	No
69654	2011_12	male	20	20-29	NA	Black	Black	Some College	NeverMarried	75000-99999	87500	4.26	6	Rent	Looking	116.7	NA	NA	184.2	34.40	NA	30.0_plus	70	132	76	126	72	134	74	130	78	474.10	1.27	4.42	101	0.561	NA	NA	No	NA	Vgood	0	0	None	None	NA	NA	NA	7	No	Yes	3	4_hr	0_to_1_hr	NA	NA	Yes	1	10	Yes	Yes	Smoker	16	Yes	16	Yes	16	No	Yes	14	4	1	No	Heterosexual	NA
54563	2009_10	male	73	70+	886	Black	NA	Some College	Divorced	5000-9999	7500	0.51	7	Own	NotWorking	75.2	NA	NA	174.2	24.78	NA	18.5_to_24.9	88	112	59	114	62	108	60	116	58	NA	NA	NA	114	1.129	NA	NA	No	NA	Good	0	0	None	None	NA	NA	NA	6	Yes	No	NA	NA	NA	NA	NA	Yes	2	364	No	Yes	Smoker	13	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
62755	2011_12	female	19	10-19	NA	Black	Black	NA	NA	10000-14999	12500	0.66	5	Rent	Working	55.2	NA	NA	162.8	20.80	NormWeight	18.5_to_24.9	66	98	67	94	60	100	66	96	68	11.05	1.37	4.47	102	0.338	NA	NA	No	NA	Fair	0	0	Most	Most	NA	NA	NA	4	No	No	5	3_hr	0_hrs	NA	NA	No	NA	NA	NA	NA	NA	NA	No	NA	No	NA	No	Yes	16	5	2	No	Heterosexual	NA
52426	2009_10	male	21	20-29	253	Black	NA	High School	NeverMarried	NA	NA	NA	8	Own	Working	61.8	NA	NA	177.4	19.64	NA	18.5_to_24.9	60	98	44	98	46	96	52	100	36	NA	1.03	4.63	28	1.556	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	5	No	No	NA	NA	NA	NA	NA	NA	NA	NA	Yes	Yes	Smoker	14	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
54460	2009_10	male	30	30-39	371	Black	NA	High School	LivePartner	25000-34999	30000	0.72	4	Rent	NotWorking	126.0	NA	NA	186.7	36.15	NA	30.0_plus	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0.88	5.79	183	2.128	NA	NA	Yes	19	NA	NA	NA	NA	NA	NA	NA	NA	8	Yes	No	NA	NA	NA	NA	NA	NA	NA	NA	Yes	Yes	Smoker	7	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
61332	2009_10	male	40	40-49	485	Black	NA	High School	NeverMarried	35000-44999	40000	1.59	5	Own	Working	117.0	NA	NA	171.3	39.87	NA	30.0_plus	94	123	82	118	84	120	84	126	80	NA	1.45	6.67	176	1.067	NA	NA	No	NA	Fair	0	0	Several	Most	NA	NA	NA	7	No	Yes	1	NA	NA	NA	NA	Yes	10	12	NA	No	Non-Smoker	NA	Yes	15	Yes	17	No	Yes	14	10	10	No	Heterosexual	NA
66283	2011_12	female	33	30-39	NA	Black	Black	Some College	Married	75000-99999	87500	3.36	6	Other	Working	113.7	NA	NA	168.2	40.20	NA	30.0_plus	70	105	81	110	80	106	82	104	80	57.50	1.27	6.31	48	0.658	180	1.241	No	NA	Good	0	0	None	None	2	2	23	6	No	No	5	2_hr	2_hr	NA	NA	Yes	2	60	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	16	1	1	No	Heterosexual	No
63545	2011_12	female	56	50-59	NA	Black	Black	College Grad	Married	more 99999	100000	5.00	10	Own	Working	63.8	NA	NA	159.4	25.10	NA	25.0_to_29.9	72	112	75	118	76	112	78	112	72	15.50	2.20	5.22	119	1.352	NA	NA	No	NA	Good	3	0	None	None	2	1	NA	8	No	Yes	NA	2_hr	0_to_1_hr	NA	NA	No	1	3	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	17	2	0	No	Heterosexual	NA
55483	2009_10	female	56	50-59	680	Black	NA	Some College	Married	55000-64999	60000	4.26	6	Own	Working	90.1	NA	NA	165.2	33.01	NA	30.0_plus	88	137	90	136	94	140	94	134	86	NA	1.47	5.64	20	0.123	73	3.174	No	NA	Fair	10	0	None	None	3	1	NA	5	No	Yes	2	NA	NA	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	Yes	19	No	NA	No	Yes	17	4	1	No	Heterosexual	NA
67014	2011_12	female	48	40-49	NA	Black	Black	High School	NeverMarried	20000-24999	22500	0.95	5	Own	NotWorking	83.2	NA	NA	160.1	32.50	NA	30.0_plus	66	98	66	102	68	98	66	98	66	10.02	1.50	3.44	22	NA	NA	NA	No	NA	Excellent	0	0	None	None	1	1	NA	6	No	No	NA	More_4_hr	0_to_1_hr	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	18	4	0	No	Heterosexual	NA
51905	2009_10	female	52	50-59	634	Black	NA	9 - 11th Grade	Married	25000-34999	30000	0.97	8	Own	NotWorking	72.0	NA	NA	157.7	28.95	NA	25.0_to_29.9	74	112	78	122	80	112	80	112	76	NA	1.29	3.70	79	0.230	NA	NA	No	NA	Fair	15	15	Most	Most	7	6	17	6	Yes	No	NA	NA	NA	NA	NA	No	NA	0	No	Yes	Smoker	15	Yes	18	No	NA	No	Yes	15	10	1	No	Heterosexual	NA
54900	2009_10	male	46	40-49	560	Black	NA	9 - 11th Grade	Married	25000-34999	30000	1.16	2	Rent	Looking	146.4	NA	NA	172.2	49.37	NA	30.0_plus	94	185	99	188	96	188	100	182	98	NA	0.96	4.42	120	1.348	NA	NA	Yes	NA	NA	NA	NA	NA	NA	NA	NA	NA	3	Yes	No	NA	NA	NA	NA	NA	NA	NA	NA	Yes	Yes	Smoker	25	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
57435	2009_10	female	56	50-59	677	Black	NA	Some College	Separated	75000-99999	87500	5.00	5	Rent	Working	68.1	NA	NA	163.0	25.63	NA	25.0_to_29.9	76	143	89	148	92	146	92	140	86	NA	1.40	6.15	244	2.324	NA	NA	No	NA	Good	1	0	None	None	8	3	24	6	No	Yes	4	NA	NA	NA	NA	Yes	1	3	No	Yes	Smoker	24	Yes	23	Yes	23	Yes	Yes	17	8	0	No	Heterosexual	NA
54478	2009_10	male	43	40-49	527	Black	NA	High School	NeverMarried	45000-54999	50000	2.27	7	Own	Working	134.7	NA	NA	176.4	43.29	NA	30.0_plus	56	139	80	144	80	138	78	140	82	NA	1.37	4.73	57	NA	NA	NA	No	NA	Vgood	0	5	None	None	NA	NA	NA	5	No	Yes	3	NA	NA	NA	NA	Yes	3	2	No	Yes	Smoker	18	Yes	14	Yes	16	Yes	Yes	19	5	0	No	Heterosexual	NA
66011	2011_12	female	12	10-19	NA	Black	Black	NA	NA	65000-74999	70000	2.07	6	Own	NA	51.3	NA	NA	163.6	19.20	NormWeight	18.5_to_24.9	64	108	62	106	60	108	60	108	64	31.14	1.47	3.70	105	0.636	NA	NA	No	NA	Vgood	0	0	NA	NA	NA	NA	NA	NA	NA	Yes	NA	0_to_1_hr	0_to_1_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
67299	2011_12	female	7	0-9	NA	Hispanic	Hispanic	NA	NA	5000-9999	7500	0.42	4	Rent	NA	19.5	NA	NA	114.7	14.80	NormWeight	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	3.32	1.24	3.49	50	0.417	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	3_hr	0_hrs	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
65315	2011_12	male	7	0-9	NA	Hispanic	Hispanic	NA	NA	75000-99999	87500	3.58	8	Own	NA	22.6	NA	NA	118.7	16.00	NormWeight	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	1.36	1.50	4.14	46	0.329	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3	2_hr	1_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
63122	2011_12	female	3	0-9	NA	Hispanic	Hispanic	NA	NA	25000-34999	30000	0.93	6	Rent	NA	18.9	101.7	NA	100.0	18.90	Obese	18.5_to_24.9	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	3_hr	0_hrs	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
66162	2011_12	female	60	60-69	NA	Hispanic	Hispanic	College Grad	Married	more 99999	100000	5.00	5	Rent	Working	77.5	NA	NA	160.0	30.30	NA	30.0_plus	62	138	83	148	84	142	86	134	80	25.83	1.32	6.36	104	1.106	NA	NA	Yes	52	Fair	4	3	Several	None	3	3	26	6	No	No	NA	0_to_1_hr	0_hrs	NA	NA	No	1	2	NA	No	Non-Smoker	NA	NA	NA	NA	NA	No	Yes	24	1	NA	No	NA	NA
57246	2009_10	male	53	50-59	646	Hispanic	NA	9 - 11th Grade	Married	55000-64999	60000	2.49	4	Rent	Working	89.0	NA	NA	176.3	28.63	NA	25.0_to_29.9	66	106	71	112	76	108	72	104	70	NA	NA	NA	124	1.319	NA	NA	Yes	52	NA	NA	NA	NA	NA	NA	NA	NA	8	No	No	NA	NA	NA	NA	NA	NA	NA	NA	No	Yes	Smoker	18	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
71359	2011_12	male	41	40-49	NA	Hispanic	Hispanic	8th Grade	Married	0-4999	2500	0.01	4	Rent	Looking	77.1	NA	NA	166.4	27.80	NA	25.0_to_29.9	66	113	67	110	66	114	72	112	62	322.46	0.91	4.19	74	0.291	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	No	No	5	More_4_hr	2_hr	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
63534	2011_12	male	39	30-39	NA	Hispanic	Hispanic	High School	Married	45000-54999	50000	1.85	4	Own	Working	86.2	NA	NA	177.8	27.30	NA	25.0_to_29.9	64	116	79	118	72	116	80	116	78	303.88	1.55	4.11	78	0.582	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	No	Yes	NA	2_hr	1_hr	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
56447	2009_10	male	17	10-19	207	Hispanic	NA	NA	NA	75000-99999	87500	3.30	7	Own	Working	85.4	NA	NA	180.6	26.18	NA	25.0_to_29.9	66	111	18	114	40	112	36	110	0	NA	1.03	5.15	54	0.831	NA	NA	No	NA	Good	0	0	NA	NA	NA	NA	NA	7	No	Yes	4	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
66788	2011_12	male	31	30-39	NA	Hispanic	Hispanic	Some College	LivePartner	NA	NA	NA	8	Own	Working	64.9	NA	NA	167.4	23.20	NA	18.5_to_24.9	70	114	67	116	72	112	70	116	64	262.00	1.45	4.45	137	0.419	NA	NA	No	NA	Vgood	5	0	None	None	NA	NA	NA	7	No	Yes	NA	More_4_hr	More_4_hr	NA	NA	Yes	12	156	Yes	Yes	Smoker	15	Yes	13	Yes	13	Yes	Yes	13	50	2	No	Heterosexual	NA
58604	2009_10	female	13	10-19	166	Hispanic	NA	NA	NA	65000-74999	70000	2.68	7	Own	NA	56.5	NA	NA	162.1	21.50	NA	18.5_to_24.9	90	100	49	102	54	98	52	102	46	NA	1.53	4.78	26	0.213	NA	NA	No	NA	Good	0	0	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
52927	2009_10	female	35	30-39	422	Hispanic	NA	Some College	Married	75000-99999	87500	5.00	7	Own	Working	85.6	NA	NA	171.4	29.14	NA	25.0_to_29.9	72	121	80	114	74	118	80	124	80	NA	1.58	4.73	127	NA	NA	NA	No	NA	Good	0	5	None	None	NA	NA	NA	10	No	No	NA	NA	NA	NA	NA	Yes	2	104	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	16	30	3	No	Heterosexual	Unknown
66833	2011_12	female	39	30-39	NA	Hispanic	Hispanic	High School	Married	10000-14999	12500	0.73	4	Rent	Working	56.4	NA	NA	157.5	22.70	NA	18.5_to_24.9	64	125	84	122	86	122	86	128	82	44.05	1.66	3.96	43	0.439	NA	NA	No	NA	Good	0	3	Several	Several	2	2	17	8	No	No	1	3_hr	1_hr	NA	NA	Yes	2	52	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	16	5	2	No	Heterosexual	No
66587	2011_12	male	47	40-49	NA	Hispanic	Hispanic	Some College	Married	45000-54999	50000	2.17	8	Own	Working	79.2	NA	NA	178.7	24.80	NA	18.5_to_24.9	70	120	70	128	78	122	72	118	68	369.51	1.60	6.49	140	2.373	NA	NA	No	NA	Fair	20	10	Most	Most	NA	NA	NA	7	No	No	7	0_to_1_hr	2_hr	NA	NA	Yes	1	12	NA	No	Non-Smoker	NA	Yes	17	Yes	17	No	Yes	17	15	1	No	Heterosexual	NA
53883	2009_10	female	35	30-39	422	Hispanic	NA	High School	Married	45000-54999	50000	2.27	7	Own	Working	66.0	NA	NA	162.8	24.90	NA	18.5_to_24.9	80	109	61	106	62	108	60	110	62	NA	2.12	4.45	48	0.578	NA	NA	No	NA	Vgood	0	5	None	None	6	2	31	6	No	Yes	3	NA	NA	NA	NA	Yes	2	156	No	Yes	Smoker	20	No	NA	No	NA	No	Yes	15	10	4	Yes	Heterosexual	No
57282	2009_10	female	39	30-39	468	Hispanic	NA	8th Grade	Married	35000-44999	40000	1.08	4	Rent	Working	54.1	NA	NA	155.7	22.32	NA	18.5_to_24.9	58	108	60	114	58	108	60	108	60	NA	0.98	6.00	60	0.311	NA	NA	No	NA	Fair	30	15	Most	None	4	4	16	8	No	No	NA	NA	NA	NA	NA	Yes	4	3	Yes	Yes	Smoker	17	No	NA	No	NA	No	Yes	16	1	1	No	Heterosexual	No
66486	2011_12	male	1	0-9	23	Hispanic	Hispanic	NA	NA	15000-19999	17500	0.76	6	Rent	NA	12.4	86.6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
70420	2011_12	female	55	50-59	NA	Hispanic	Hispanic	8th Grade	Widowed	15000-19999	17500	0.43	4	Rent	Looking	100.0	NA	NA	159.4	39.40	NA	30.0_plus	76	122	63	120	66	126	62	118	64	14.73	1.06	4.22	16	0.123	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	8	No	No	NA	2_hr	0_hrs	NA	NA	NA	NA	NA	No	Yes	Smoker	38	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
54816	2009_10	female	32	30-39	392	Hispanic	NA	8th Grade	NeverMarried	NA	NA	NA	5	Rent	Working	51.1	NA	NA	147.6	23.46	NA	18.5_to_24.9	58	99	60	98	64	98	60	100	60	NA	1.50	4.81	42	0.276	NA	NA	No	NA	Fair	2	3	None	Several	3	2	20	6	No	No	NA	NA	NA	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	20	NA	1	No	Heterosexual	No
57160	2009_10	male	49	40-49	591	Hispanic	NA	Some College	Married	35000-44999	40000	1.95	5	Own	Working	92.7	NA	NA	173.1	30.94	NA	30.0_plus	82	125	84	118	84	124	82	126	86	NA	1.45	7.16	42	0.724	128	1.196	No	NA	Good	0	0	None	None	NA	NA	NA	9	No	No	NA	NA	NA	NA	NA	Yes	6	364	No	Yes	Smoker	14	Yes	30	No	NA	No	Yes	16	60	1	No	Heterosexual	NA
70792	2011_12	male	30	30-39	NA	Hispanic	Hispanic	8th Grade	LivePartner	NA	NA	0.52	3	Rent	Working	67.2	NA	NA	160.8	26.00	NA	25.0_to_29.9	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	300	1.364	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	8	No	Yes	NA	0_to_1_hr	0_hrs	NA	NA	NA	NA	NA	No	Yes	Smoker	18	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
71658	2011_12	male	25	20-29	NA	Mexican	Mexican	9 - 11th Grade	LivePartner	5000-9999	7500	0.13	4	Rent	Working	90.5	NA	NA	168.3	32.00	NA	30.0_plus	58	124	77	118	74	126	76	122	78	416.63	0.80	6.18	155	2.214	NA	NA	No	NA	Good	0	0	None	None	NA	NA	NA	7	No	Yes	NA	2_hr	0_hrs	NA	NA	Yes	3	104	Yes	Yes	Smoker	13	Yes	16	No	NA	No	Yes	16	6	1	No	Heterosexual	NA
68906	2011_12	female	16	10-19	NA	Mexican	Mexican	NA	NA	25000-34999	30000	1.15	3	Rent	NotWorking	74.3	NA	NA	156.0	30.50	Obese	30.0_plus	78	107	12	110	42	106	24	108	0	17.43	1.45	5.40	32	0.176	58	0.395	No	NA	Fair	0	0	NA	NA	NA	NA	NA	6	No	No	4	3_hr	3_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
55576	2009_10	female	37	30-39	453	Mexican	NA	Some College	Married	75000-99999	87500	2.71	6	Own	NotWorking	53.1	NA	NA	154.4	22.27	NA	18.5_to_24.9	86	102	71	102	68	104	72	100	70	NA	0.67	4.22	109	0.474	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	6	No	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No
56866	2009_10	male	46	40-49	563	Mexican	NA	9 - 11th Grade	Married	20000-24999	22500	0.93	5	Rent	Working	93.8	NA	NA	171.3	31.97	NA	30.0_plus	86	129	91	128	94	128	90	130	92	NA	1.19	5.61	73	0.664	NA	NA	Yes	40	Fair	0	0	Several	Several	NA	NA	NA	8	No	No	NA	NA	NA	NA	NA	Yes	NA	0	NA	No	Non-Smoker	NA	No	NA	No	NA	Yes	Yes	20	20	0	No	Heterosexual	NA
58981	2009_10	female	44	40-49	535	Mexican	NA	8th Grade	Separated	5000-9999	7500	0.31	3	Rent	Working	81.1	NA	NA	153.8	34.29	NA	30.0_plus	68	109	63	112	66	110	64	108	62	NA	0.75	4.34	114	1.869	NA	NA	No	NA	Good	0	0	None	None	2	2	17	7	No	Yes	5	NA	NA	NA	NA	Yes	NA	0	NA	No	Non-Smoker	NA	No	NA	No	NA	No	No	NA	0	0	No	Heterosexual	No
61744	2009_10	female	18	10-19	226	Mexican	NA	NA	NA	NA	NA	0.44	3	Rent	NotWorking	66.5	NA	NA	154.7	27.79	NA	25.0_to_29.9	70	104	52	108	52	104	56	104	48	NA	1.50	3.70	44	0.071	30	0.291	No	NA	Fair	7	10	None	None	NA	NA	NA	8	No	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	No	NA	No	Yes	15	5	4	No	Heterosexual	NA
59300	2009_10	male	22	20-29	271	Mexican	NA	8th Grade	NeverMarried	35000-44999	40000	2.40	2	Rent	Working	72.3	NA	NA	169.3	25.22	NA	25.0_to_29.9	72	123	63	132	66	124	62	122	64	NA	1.42	4.22	133	1.090	NA	NA	No	NA	Vgood	0	0	None	None	NA	NA	NA	9	No	No	NA	NA	NA	NA	NA	No	NA	1	Yes	Yes	Smoker	15	No	NA	No	NA	No	Yes	17	11	6	No	Heterosexual	NA
70769	2011_12	female	21	20-29	NA	Mexican	Mexican	9 - 11th Grade	NeverMarried	20000-24999	22500	0.86	4	Rent	NotWorking	61.4	NA	NA	152.0	26.60	NA	25.0_to_29.9	70	89	56	82	54	86	54	92	58	32.99	1.53	4.47	45	0.111	NA	NA	No	NA	Fair	0	0	None	None	2	2	17	8	No	No	NA	2_hr	2_hr	NA	NA	Yes	3	2	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	15	5	1	No	Heterosexual	No
58333	2009_10	male	14	10-19	174	Mexican	NA	NA	NA	more 99999	100000	5.00	7	Own	NA	89.4	NA	NA	171.1	30.54	NA	30.0_plus	98	108	63	108	68	104	62	112	64	NA	1.09	3.85	132	NA	NA	NA	No	NA	Vgood	2	0	NA	NA	NA	NA	NA	NA	NA	Yes	5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
60053	2009_10	male	20	20-29	248	Mexican	NA	9 - 11th Grade	LivePartner	more 99999	100000	3.76	9	Own	Looking	90.7	NA	NA	182.6	27.20	NA	25.0_to_29.9	78	108	55	114	62	108	54	108	56	NA	0.75	4.78	227	0.652	NA	NA	No	NA	Vgood	8	2	None	None	NA	NA	NA	7	No	No	NA	NA	NA	NA	NA	Yes	5	4	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	16	5	1	No	Heterosexual	NA
59359	2009_10	male	10	10-19	123	Mexican	NA	NA	NA	20000-24999	22500	0.95	4	Rent	NA	38.8	NA	NA	145.0	18.45	NA	12.0_18.5	80	92	57	90	60	92	56	92	58	NA	NA	NA	200	1.087	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
61834	2009_10	female	46	40-49	557	Mexican	NA	College Grad	NeverMarried	45000-54999	50000	1.86	4	Own	Working	65.8	NA	NA	160.6	25.51	NA	25.0_to_29.9	58	104	56	NA	NA	106	54	102	58	NA	1.66	4.91	116	0.959	NA	NA	No	NA	Fair	2	7	Several	Several	3	1	NA	7	Yes	Yes	3	NA	NA	NA	NA	Yes	3	24	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	19	10	1	No	Heterosexual	NA
65265	2011_12	male	56	50-59	NA	Mexican	Mexican	9 - 11th Grade	LivePartner	10000-14999	12500	0.50	4	Rent	Looking	87.8	NA	NA	175.1	28.60	NA	25.0_to_29.9	60	108	66	106	48	108	62	108	70	411.37	1.24	4.60	64	0.512	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	No	No	NA	2_hr	0_hrs	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
61595	2009_10	male	35	30-39	423	Mexican	NA	8th Grade	Married	25000-34999	30000	1.24	5	Own	Working	80.3	NA	NA	169.7	27.88	NA	25.0_to_29.9	66	110	65	114	64	108	70	112	60	NA	NA	NA	125	1.543	NA	NA	No	NA	Vgood	0	0	None	None	NA	NA	NA	7	No	No	NA	NA	NA	NA	NA	Yes	1	104	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	15	10	1	No	Heterosexual	NA
65823	2011_12	male	37	30-39	NA	Mexican	Mexican	9 - 11th Grade	Separated	0-4999	2500	0.33	4	Own	Working	87.7	NA	NA	172.4	29.50	NA	25.0_to_29.9	66	132	88	132	92	132	86	132	90	608.95	1.11	4.53	144	0.246	NA	NA	No	NA	Fair	0	0	None	None	NA	NA	NA	4	No	Yes	3	3_hr	0_hrs	NA	NA	Yes	12	52	Yes	Yes	Smoker	12	Yes	12	Yes	12	Yes	Yes	15	60	2	No	Heterosexual	NA
64181	2011_12	female	56	50-59	NA	Mexican	Mexican	High School	LivePartner	75000-99999	87500	5.00	7	Own	NotWorking	98.3	NA	NA	164.5	36.30	NA	30.0_plus	68	104	70	106	70	106	72	102	68	10.02	1.50	5.72	126	2.100	NA	NA	No	NA	Vgood	0	7	None	None	2	2	16	5	Yes	No	5	2_hr	2_hr	NA	NA	Yes	2	3	No	Yes	Smoker	14	Yes	19	Yes	19	Yes	Yes	15	5	1	No	Heterosexual	NA
61103	2009_10	male	42	40-49	509	Mexican	NA	Some College	Married	55000-64999	60000	3.34	7	Own	Working	65.2	NA	NA	167.7	23.18	NA	18.5_to_24.9	78	121	61	126	58	122	62	120	60	NA	2.20	6.15	99	1.707	NA	NA	No	NA	Good	0	0	None	None	NA	NA	NA	8	No	Yes	4	NA	NA	NA	NA	Yes	1	260	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	18	6	1	No	Heterosexual	NA
56867	2009_10	male	30	30-39	364	Mexican	NA	High School	Married	more 99999	100000	4.51	5	Own	Working	79.1	NA	NA	170.0	27.37	NA	25.0_to_29.9	78	111	75	116	66	110	74	112	76	NA	0.91	4.76	87	0.978	NA	NA	No	NA	Good	14	0	None	None	NA	NA	NA	6	No	No	NA	NA	NA	NA	NA	Yes	3	36	NA	No	Non-Smoker	NA	No	NA	No	NA	Yes	Yes	15	4	1	No	Heterosexual	NA
60497	2009_10	female	5	0-9	68	Mexican	NA	NA	NA	NA	NA	NA	9	Own	NA	21.7	NA	NA	111.1	17.58	NA	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
69363	2011_12	female	22	20-29	NA	Mexican	Mexican	High School	NeverMarried	10000-14999	12500	0.54	4	Rent	Working	92.4	NA	NA	159.2	36.50	NA	30.0_plus	70	103	77	102	72	106	76	100	78	21.90	1.14	4.40	104	0.972	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	8	No	Yes	2	3_hr	0_hrs	NA	NA	NA	NA	NA	Yes	Yes	Smoker	12	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No
53440	2009_10	female	8	0-9	105	White	NA	NA	NA	55000-64999	60000	2.40	9	Own	NA	30.5	NA	NA	132.7	17.32	NA	12.0_18.5	100	89	52	94	44	90	56	88	48	NA	1.91	4.01	59	0.881	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
61085	2009_10	female	51	50-59	614	White	NA	Some College	Married	more 99999	100000	5.00	8	Own	NotWorking	70.4	NA	NA	168.1	24.91	NA	25.0_to_29.9	64	124	75	130	72	124	76	124	74	NA	1.91	5.07	57	1.213	NA	NA	No	NA	Good	30	0	None	None	2	2	22	8	No	No	NA	NA	NA	NA	NA	Yes	1	156	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	21	2	0	No	Heterosexual	NA
62418	2011_12	male	80	NA	NA	White	White	8th Grade	Widowed	25000-34999	30000	1.98	4	Own	NotWorking	70.1	NA	NA	172.6	23.50	NA	18.5_to_24.9	62	164	68	156	64	162	72	166	64	683.12	0.96	4.45	77	0.316	NA	NA	No	NA	Fair	0	0	None	None	NA	NA	NA	7	No	No	2	More_4_hr	0_hrs	NA	NA	Yes	NA	0	Yes	Yes	Smoker	13	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
69919	2011_12	male	31	30-39	NA	White	White	Some College	NeverMarried	NA	NA	NA	3	Rent	Working	58.4	NA	NA	163.9	21.70	NA	18.5_to_24.9	66	116	72	122	72	118	72	114	72	727.14	1.14	4.71	27	0.284	343	3.206	No	NA	Vgood	0	1	None	None	NA	NA	NA	8	No	Yes	NA	2_hr	2_hr	NA	NA	Yes	4	104	No	Yes	Smoker	14	Yes	13	Yes	15	Yes	Yes	13	60	5	Yes	Heterosexual	NA
69488	2011_12	female	65	60-69	NA	White	White	High School	Divorced	20000-24999	22500	1.30	6	Rent	Working	85.3	NA	NA	171.2	29.10	NA	25.0_to_29.9	54	102	60	108	62	100	60	104	60	6.94	1.29	4.58	18	0.220	26	0.208	No	NA	Vgood	0	3	Most	Several	5	4	17	8	No	No	3	More_4_hr	More_4_hr	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	No	Yes	16	5	NA	No	NA	NA
66506	2011_12	female	25	20-29	NA	White	White	College Grad	Married	15000-19999	17500	1.16	2	Rent	Working	50.0	NA	NA	169.0	17.50	NA	12.0_18.5	58	98	62	98	58	98	64	98	60	NA	NA	NA	181	0.973	NA	NA	No	NA	Vgood	0	5	None	Several	NA	NA	NA	9	No	Yes	7	2_hr	2_hr	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	Yes	20	No	NA	No	Yes	20	1	1	No	Heterosexual	No
55849	2009_10	male	57	50-59	686	White	NA	High School	Divorced	25000-34999	30000	3.05	6	Own	Working	105.4	NA	NA	165.3	38.57	NA	30.0_plus	78	120	70	134	78	122	70	118	70	NA	1.01	5.20	157	4.361	NA	NA	Yes	43	Good	0	0	None	None	NA	NA	NA	6	No	Yes	7	NA	NA	NA	NA	Yes	6	104	No	Yes	Smoker	18	NA	NA	NA	NA	No	Yes	16	50	12	No	Heterosexual	NA
68423	2011_12	female	42	40-49	NA	White	White	College Grad	Married	more 99999	100000	5.00	9	Own	Working	61.5	NA	NA	166.1	22.30	NA	18.5_to_24.9	72	123	65	110	70	124	66	122	64	28.63	2.30	5.20	123	0.837	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	7	No	No	NA	2_hr	1_hr	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No
57753	2009_10	male	74	70+	890	White	NA	9 - 11th Grade	Married	75000-99999	87500	5.00	5	Own	NotWorking	63.6	NA	NA	166.9	22.83	NA	18.5_to_24.9	72	129	65	130	60	128	66	130	64	NA	0.98	4.22	142	0.394	NA	NA	No	NA	Vgood	15	0	Most	Several	NA	NA	NA	10	No	No	NA	NA	NA	NA	NA	Yes	1	4	No	Yes	Smoker	16	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
67001	2011_12	male	76	70+	NA	White	White	9 - 11th Grade	Married	45000-54999	50000	3.40	8	Own	NotWorking	81.1	NA	NA	173.9	26.80	NA	25.0_to_29.9	68	116	63	116	62	110	62	122	64	650.05	1.58	6.34	95	0.601	NA	NA	No	NA	Good	3	0	None	None	NA	NA	NA	7	Yes	No	NA	3_hr	0_to_1_hr	NA	NA	No	NA	NA	No	Yes	Smoker	14	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
70980	2011_12	male	32	30-39	NA	White	White	High School	Married	45000-54999	50000	1.28	13	Own	Working	87.7	NA	NA	178.9	27.40	NA	25.0_to_29.9	88	122	68	122	74	124	66	120	70	418.63	1.42	5.04	280	0.438	NA	NA	No	NA	Vgood	0	0	None	None	NA	NA	NA	7	No	No	NA	2_hr	0_to_1_hr	NA	NA	Yes	1	2	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	19	4	1	No	Heterosexual	NA
67988	2011_12	male	14	10-19	NA	White	White	NA	NA	more 99999	100000	5.00	6	Own	NA	66.3	NA	NA	173.0	22.20	NormWeight	18.5_to_24.9	88	128	54	122	58	124	50	132	58	71.35	1.22	3.93	260	1.126	NA	NA	No	NA	Good	5	0	NA	NA	NA	NA	NA	NA	NA	Yes	NA	2_hr	1_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
56323	2009_10	female	53	50-59	636	White	NA	High School	Married	55000-64999	60000	3.28	10	Own	Working	75.5	NA	NA	163.0	28.42	NA	25.0_to_29.9	76	122	80	120	86	118	86	126	74	NA	1.50	4.99	97	1.276	NA	NA	No	NA	Good	0	0	None	None	2	2	23	6	Yes	No	NA	NA	NA	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	19	1	1	No	Heterosexual	NA
68294	2011_12	female	11	10-19	NA	White	White	NA	NA	75000-99999	87500	3.30	6	Own	NA	78.8	NA	NA	161.3	30.30	Obese	30.0_plus	88	98	71	NA	NA	100	72	96	70	11.39	1.01	3.52	18	0.171	29	0.397	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4_hr	1_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
63418	2011_12	female	40	40-49	NA	White	White	9 - 11th Grade	NeverMarried	35000-44999	40000	2.20	6	Rent	NotWorking	NA	NA	NA	NA	NA	NA	NA	122	115	48	116	54	114	44	116	52	30.53	0.83	3.39	64	0.557	NA	NA	Yes	30	Good	15	0	Several	Several	NA	NA	NA	5	Yes	Yes	NA	More_4_hr	0_to_1_hr	NA	NA	Yes	3	10	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	39	1	1	No	Heterosexual	No
63046	2011_12	male	1	0-9	NA	White	White	NA	NA	25000-34999	30000	1.73	7	Own	NA	11.9	86.1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
56376	2009_10	male	0	0-9	10	White	NA	NA	NA	more 99999	100000	4.54	7	Own	NA	9.7	76.8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
57197	2009_10	male	18	10-19	223	White	NA	NA	NA	75000-99999	87500	3.49	12	Own	Working	75.2	NA	NA	183.0	22.46	NA	18.5_to_24.9	92	112	56	110	54	114	60	110	52	NA	0.98	4.11	111	0.816	NA	NA	No	NA	Excellent	0	1	None	None	NA	NA	NA	10	No	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	No	NA	No	No	NA	0	0	No	Heterosexual	NA
53351	2009_10	female	27	20-29	325	White	NA	College Grad	LivePartner	more 99999	100000	5.00	4	Own	Working	83.8	NA	NA	180.6	25.69	NA	25.0_to_29.9	60	110	62	108	64	110	64	110	60	NA	1.89	6.13	366	0.915	NA	NA	No	NA	Vgood	1	0	None	None	NA	NA	NA	6	No	Yes	6	NA	NA	NA	NA	Yes	1	208	No	Yes	Smoker	18	Yes	17	No	NA	No	Yes	18	20	1	No	Heterosexual	No
65688	2011_12	female	54	50-59	NA	White	White	Some College	Married	more 99999	100000	5.00	8	Own	Working	64.8	NA	NA	160.9	25.00	NA	25.0_to_29.9	64	122	69	124	66	118	66	126	72	8.89	1.68	6.47	64	0.566	NA	NA	No	NA	Vgood	0	0	None	None	2	2	31	8	No	Yes	4	1_hr	1_hr	NA	NA	Yes	3	120	No	Yes	Smoker	18	Yes	16	Yes	18	Yes	Yes	16	6	0	No	Heterosexual	NA
70092	2011_12	male	24	20-29	NA	Other	Other	Some College	NeverMarried	65000-74999	70000	3.51	3	Rent	Looking	85.5	NA	NA	184.1	25.20	NA	25.0_to_29.9	96	143	59	NA	NA	140	58	146	60	886.96	1.19	4.11	279	1.446	NA	NA	No	NA	Good	0	0	None	None	NA	NA	NA	7	Yes	Yes	6	4_hr	1_hr	NA	NA	Yes	5	208	No	Yes	Smoker	15	Yes	12	Yes	14	Yes	Yes	12	20	3	No	Heterosexual	NA
57059	2009_10	male	10	10-19	125	Other	NA	NA	NA	10000-14999	12500	0.54	6	Rent	NA	23.4	NA	NA	127.6	14.37	NA	12.0_18.5	86	108	47	114	48	108	50	108	44	NA	1.42	3.93	18	0.207	32	0.727	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5	5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
51945	2009_10	female	46	40-49	558	Other	NA	9 - 11th Grade	Widowed	15000-19999	17500	0.09	3	Rent	NotWorking	78.1	NA	NA	150.0	34.71	NA	30.0_plus	90	121	74	130	78	124	76	118	72	NA	1.27	7.19	56	0.514	NA	NA	Yes	41	Poor	13	7	Most	None	4	3	16	6	Yes	No	NA	NA	NA	NA	NA	Yes	2	12	Yes	Yes	Smoker	16	Yes	18	Yes	19	Yes	Yes	15	1	1	No	Heterosexual	NA
52618	2009_10	female	6	0-9	73	Other	NA	NA	NA	35000-44999	40000	1.26	4	Rent	NA	26.9	NA	NA	122.3	17.98	NA	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1.37	3.90	107	0.241	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
63425	2011_12	male	51	50-59	NA	Other	Asian	Some College	Married	75000-99999	87500	4.03	6	Own	Working	67.3	NA	NA	175.1	22.00	NA	18.5_to_24.9	76	119	79	126	82	122	78	116	80	326.28	0.91	4.50	201	4.102	NA	NA	No	NA	Good	0	0	Several	None	NA	NA	NA	4	Yes	No	NA	1_hr	1_hr	NA	NA	No	NA	0	No	Yes	Smoker	18	No	NA	No	NA	Yes	Yes	23	5	1	No	Heterosexual	NA
51711	2009_10	female	59	50-59	718	Other	NA	8th Grade	Widowed	20000-24999	22500	1.37	4	Rent	NotWorking	54.3	NA	NA	145.1	25.79	NA	25.0_to_29.9	84	150	0	144	0	150	0	150	0	NA	1.06	4.16	42	0.389	NA	NA	Yes	51	NA	NA	NA	NA	NA	NA	NA	NA	5	Yes	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
68117	2011_12	male	28	20-29	NA	Other	Other	High School	Married	75000-99999	87500	3.25	10	Own	Working	92.1	NA	NA	170.9	31.50	NA	30.0_plus	78	121	62	120	70	120	64	122	60	NA	NA	NA	72	0.649	NA	NA	No	NA	Good	0	5	Several	Several	NA	NA	NA	6	No	Yes	NA	3_hr	1_hr	NA	NA	Yes	6	312	Yes	Yes	Smoker	16	Yes	18	Yes	18	No	Yes	16	5	1	No	Heterosexual	NA
53141	2009_10	male	14	10-19	169	Other	NA	NA	NA	10000-14999	12500	0.41	5	Rent	NA	53.6	NA	NA	164.0	19.93	NA	18.5_to_24.9	82	109	62	124	68	106	60	112	64	NA	1.01	2.17	173	0.499	NA	NA	No	NA	Vgood	0	5	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
69843	2011_12	male	40	40-49	NA	Other	Asian	College Grad	Married	more 99999	100000	5.00	8	Own	Working	90.0	NA	NA	185.2	26.20	NA	25.0_to_29.9	78	151	102	142	102	154	102	148	102	220.65	1.14	4.14	56	0.933	NA	NA	No	NA	Good	0	0	None	None	NA	NA	NA	7	No	Yes	7	0_to_1_hr	0_to_1_hr	NA	NA	Yes	2	48	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	29	1	0	No	Heterosexual	NA
61063	2009_10	female	71	70+	852	Other	NA	High School	Married	35000-44999	40000	1.01	7	Own	Working	46.6	NA	NA	142.7	22.88	NA	18.5_to_24.9	86	138	40	140	54	138	40	NA	NA	NA	1.68	4.76	38	0.585	NA	NA	No	NA	Good	5	0	None	None	4	3	23	6	No	Yes	5	NA	NA	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
65384	2011_12	female	72	70+	NA	Other	Asian	High School	NeverMarried	35000-44999	40000	1.68	10	Own	NotWorking	48.9	NA	NA	148.7	22.10	NA	18.5_to_24.9	72	119	68	104	78	110	66	128	70	28.80	2.15	4.40	75	0.620	NA	NA	No	NA	Good	0	0	Several	None	NA	NA	NA	8	No	No	NA	3_hr	3_hr	NA	NA	Yes	1	6	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
68007	2011_12	female	33	30-39	NA	Other	Asian	Some College	NeverMarried	25000-34999	30000	1.92	3	Rent	NotWorking	56.9	NA	NA	162.3	21.60	NA	18.5_to_24.9	78	101	70	102	64	98	68	104	72	15.95	1.99	4.24	79	0.675	NA	NA	No	NA	Good	0	0	Most	None	NA	NA	NA	5	No	No	NA	0_to_1_hr	1_hr	NA	NA	No	NA	NA	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	27	1	1	No	Heterosexual	No
71833	2011_12	female	30	30-39	NA	Other	Asian	College Grad	Married	65000-74999	70000	4.76	4	Rent	NotWorking	48.8	NA	NA	158.2	19.50	NA	18.5_to_24.9	84	83	54	88	56	86	60	80	48	17.49	1.27	5.12	23	0.793	68	0.791	No	NA	Good	7	0	None	None	NA	NA	NA	8	No	No	NA	1_hr	2_hr	NA	NA	Yes	1	12	NA	No	Non-Smoker	NA	No	NA	No	NA	No	Yes	20	8	1	No	Heterosexual	No
60755	2009_10	male	21	20-29	252	Other	NA	9 - 11th Grade	NeverMarried	more 99999	100000	4.27	8	Own	NotWorking	60.0	NA	NA	176.9	19.17	NA	18.5_to_24.9	62	104	65	114	62	102	64	106	66	NA	1.45	4.68	131	0.633	NA	NA	No	NA	Vgood	0	14	Several	Several	NA	NA	NA	7	No	Yes	1	NA	NA	NA	NA	No	NA	0	NA	No	Non-Smoker	NA	No	NA	No	NA	No	No	NA	0	0	No	Heterosexual	NA
67803	2011_12	male	6	0-9	NA	Other	Other	NA	NA	5000-9999	7500	0.30	4	Rent	NA	23.7	NA	NA	124.7	15.20	NormWeight	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	1.75	1.37	4.84	197	3.230	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5	3_hr	1_hr	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
51832	2009_10	male	38	30-39	458	Other	NA	College Grad	Married	20000-24999	22500	1.07	9	Own	Working	78.9	NA	NA	174.5	25.91	NA	25.0_to_29.9	84	118	70	112	72	120	72	116	68	NA	0.88	4.71	35	0.461	164	1.058	No	NA	Excellent	0	0	None	None	NA	NA	NA	8	No	No	NA	NA	NA	NA	NA	Yes	1	52	No	Yes	Smoker	20	No	NA	No	NA	No	Yes	28	1	1	No	Heterosexual	NA
54285	2009_10	male	4	0-9	48	Other	NA	NA	NA	75000-99999	87500	3.63	7	Own	NA	19.0	NA	NA	104.3	17.47	NA	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
53190	2009_10	female	47	40-49	573	Other	NA	8th Grade	Married	NA	NA	NA	7	Own	Working	60.1	NA	NA	157.8	24.14	NA	18.5_to_24.9	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1.78	5.25	72	0.818	NA	NA	No	NA	Good	0	30	None	NA	6	3	30	5	No	No	NA	NA	NA	NA	NA	No	3	3	NA	No	Non-Smoker	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
53064	2009_10	female	2	0-9	24	Other	NA	NA	NA	75000-99999	87500	4.64	4	Own	NA	12.1	NA	NA	86.1	16.32	NA	12.0_18.5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	No	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5	6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
68873	2011_12	female	24	20-29	NA	Other	Asian	College Grad	NeverMarried	5000-9999	7500	0.66	5	Rent	Working	134.6	NA	NA	168.3	47.50	NA	30.0_plus	90	105	79	NA	NA	108	82	102	76	23.72	0.88	5.48	158	0.477	NA	NA	Yes	16	Poor	15	0	None	None	NA	NA	NA	6	Yes	Yes	NA	More_4_hr	More_4_hr	NA	NA	Yes	1	52	NA	No	Non-Smoker	NA	No	NA	No	NA	No	No	NA	0	0	No	Heterosexual	No

Systematic sampling is where you randomly choose a starting place then select every $k^{th}$ observation to measure.

For example:

You select every $5^{th}$ item on an assembly line
You select every $10^{th}$ name on the list

You select every $3^{rd}$ customer that comes into the store.

Make sure you randomly select the starting point. Also, if you want a sample with 100 units of observations, and you have a population that has 10,000 units of observation, then you would want to select every 10,000/100=100 units of observations.

Cluster sampling is where you break the population into groups called clusters. Randomly pick some clusters then poll all observations in those clusters.

For example:

A large city wants to poll all businesses in the city. They divide the city into sections (clusters), maybe a square block for each section, and use a random number generator to pick some of the clusters. Then they poll all businesses in each chosen cluster.
You want to measure whether a tree in the forest is infected with bark beetles. Instead of having to walk all over the forest, you divide the forest up into sectors (clusters), and then randomly pick the sectors (clusters) that you will travel to. Then record whether a tree is infected or not for every tree in that sector (cluster).

Many people confuse stratified sampling and cluster sampling. In stratified sampling you use all the groups and some of the members in each group. Cluster sampling is the other way around. It uses some of the groups and all the members in each group.

The four sampling techniques that were presented all have advantages and disadvantages. There is another sampling technique that is sometimes utilized because either the researcher doesn’t know better, or it is easier to do. This sampling technique is known as a convenience sample. This sample will not result in a representative sample, and should be avoided.

Convenience sample is one where the researcher picks observations to be included that are easy for the researcher to collect.

An example of a convenience sample is if you want to know the opinion of people about the criminal justice system, and you stand on a street corner near the county court house, and questioning the first 10 people who walk by. The people who walk by the county court house are most likely involved in some fashion with the criminal justice system, and their opinion would not represent the opinions of all observations.

On a rare occasion, you do want to collect the entire population. In which case you conduct a census.

A census is when every observation is measured.

1.2.5 Example: Sampling type

Banner Health is a several state nonprofit chain of hospitals. Management wants to assess the incident of complications after surgery. They wish to use a sample of surgery patients. Several sampling techniques are described below. Categorize each technique as simple random sample, stratified sample, systematic sample, cluster sample, or convenience sampling.

Obtain a list of patients who had surgery at all Banner Health facilities. Divide the patients according to type of surgery. Draw simple random samples from each group.
Obtain a list of patients who had surgery at all Banner Health facilities. Number these patients, and then use a random number table to obtain the sample.
Randomly select some Banner Health facilities from each of the seven states, and then include all the patients on the surgery lists of the states.
At the beginning of the year, instruct each Banner Health facility to record any complications from every 100^th^ surgery.
Instruct each Banner Health facilities to record any complications from 20 surgeries this week and send in the results.

1.2.5.1 Solution

Obtain a list of patients who had surgery at all Banner Health facilities. Divide the patients according to type of surgery. Draw simple random samples from each group.

This is a stratified sample since the patients where separated into different stratum and then random samples were taken from each strata. The problem with this is that some types of surgeries may have more chances for complications than others. Of course, the stratified sample would show you this.
Obtain a list of patients who had surgery at all Banner Health facilities. Number these patients, and then use a random number table to obtain the sample.

This is a random sample since each patient has the same chance of being chosen. The problem with this one is that it will take a while to collect the data.
Randomly select some Banner Health facilities from each of the seven states, and then include all the patients on the surgery lists of the states.

This is a cluster sample since all patients are questioned in each of the selected hospitals. The problem with this is that you could have by chance selected hospitals that have no complications.
At the beginning of the year, instruct each Banner Health facility to record any complications from every 100^th^ surgery.

This is a systematic sample since they selected every $100^{th}$ surgery. The problem with this is that if every $90^{th}$ surgery has complications, you wouldn’t see this come up in the data.
Instruct each Banner Health facilities to record any complications from 20 surgeries this week and send in the results.

This is a convenience sample since they left it up to the facility how to do it. The problem with convenience samples is that the person collecting the data will probably collect data from surgeries that had no complications.

1.2.6 Homework for Sampling Methods Section

Researchers want to collect cholesterol levels of U.S. patients who had a heart attack two days prior. The following are different sampling techniques that the researcher could use. Classify each as simple random sample, stratified sample, systematic sample, cluster sample, or convenience sample.
1. The researchers randomly select 5 hospitals in the U.S. then measure the cholesterol levels of all the heart attack patients in each of those hospitals.
2. The researchers list all of the heart attack patients and measure the cholesterol level of every $25^{th}$ person on the list.
3. The researchers go to one hospital on a given day and measure the cholesterol level of the heart attack patients at that time.
4. The researchers list all of the heart attack patients. They then measure the cholesterol levels of randomly selected patients.
5. The researchers divide the heart attack patients based on race, and then measure the cholesterol levels of randomly selected patients in each race grouping.
The quality control officer at a manufacturing plant needs to determine what percentage of items in a batch are defective. The following are different sampling techniques that could be used by the officer. Classify each as simple random sample, stratified sample, systematic sample, cluster sample, or convenience sample.

The officer lists all of the batches in a given month. The number of defective items is counted in randomly selected batches.
The officer takes the first 10 batches and counts the number of defective items.
The officer groups the batches made in a month into which shift they are made. The number of defective items is counted in randomly selected batches in each shift.
The officer chooses every $15^{th}$ batch off the line and counts the number of defective items in each chosen batch.

The officer divides the batches made in a month into which day they were made. Then certain days are picked and every batch made that day is counted to determine the number of defective items.

You wish to determine the GPA of students at your school. Describe what process you would go through to collect a sample if you use a simple random sample.
You wish to determine the GPA of students at your school. Describe what process you would go through to collect a sample if you use a stratified sample.
You wish to determine the GPA of students at your school. Describe what process you would go through to collect a sample if you use a systematic sample.
You wish to determine the GPA of students at your school. Describe what process you would go through to collect a sample if you use a cluster sample.
You wish to determine the GPA of students at your school. Describe what process you would go through to collect a sample if you use a convenience sample.

1.3 Experimental Design

The section is an introduction to experimental design. This is how to actually design an experiment or a survey so that they are statistical sound. Experimental design is a very involved process, so this is just a small introduction.

1.3.1 Guidelines for planning a statistical study

Identify the observations that you are interested in. Realize that you can only make conclusions for these observations. As an example, if you use a fertilizer on a certain genus of plant, you can’t say how the fertilizer will work on any other types of plants. However, if you diversify too much, then you may not be able to tell if there really is an improvement since you have too many factors to consider.
Specify the variable. You want to make sure this is something that you can measure, and make sure that you control for all other factors too. As an example, if you are trying to determine if a fertilizer works by measuring the height of the plants on a particular day, you need to make sure you can control how much fertilizer you put on the plants (which would be your treatment), and make sure that all the plants receive the same amount of sunlight, water, and temperature.
Specify the population. This is important in order for you know what conclusions you can make and what observations you are making the conclusions about.
Specify the method for taking measurements or making observations.
Determine if you are taking a census or sample. If taking a sample, decide on the sampling method.
Collect the data.
Use appropriate descriptive statistics methods and make decisions using appropriate inferential statistics methods.
Note any concerns you might have about your data collection methods and list any recommendations for future.

There are two types of studies:

An observational study is when the investigator collects data merely by watching or asking questions. Nothing is change or controlled

An experiment is when the investigator changes a variable or imposes a treatment to determine its effect.

1.3.2 Example: Observational Study or Experiment

State if the following is an observational study or an experiment.

Poll students to see if they favor increasing tuition.
Give some students a tutor to see if grades improve.

1.3.2.1 Solution

Poll students to see if they favor increasing tuition.

This is an observational study. You are only asking a question.
Give some students a tutor to see if grades improve.

This is an experiment. The tutor is the treatment.

1.3.3 Survey

Many observational studies involve surveys. A survey uses questions to collect the data and needs to be written so that there is no bias.

1.3.4 Experiment Options

In an experiment, there are different options.

Randomized two-treatment experiment: in this experiment, there are two treatments, and observations are randomly placed into the two groups. Either both groups get a treatment, or one group gets a treatment and the other gets either nothing or a placebo. The group getting either an old treatment, no treatment or a placebo is called the control group. The group getting the treatment is called the treatment group. The idea of the placebo is that a person thinks they are receiving a treatment, but in reality they are receiving a sugar pill or fake treatment. Doing this helps to account for the placebo effect, which is where a person’s mind makes their body respond to a treatment because they think they are taking the treatment when they are not really taking the treatment. Note, not every experiment needs a placebo, such when using animals or plants. Also, you can’t always use a placebo or no treatment. As an example, if you are testing a new blood pressure medication you can’t give a person with high blood pressure a placebo or no treatment because of moral reasons.

Randomized Block Design: a block is a group of subjects that are similar, but the blocks differ from each other. Then randomly assign treatments to subjects inside each block. An example would be separating students into full-time versus part-time, and then randomly picking a certain number full-time students to get the treatment and a certain number part-time students to get the treatment. This way some of each type of student gets the treatment and some do not.

Rigorously Controlled Design: carefully assign subjects to different treatment groups, so that those given each treatment are similar in ways that are important to the experiment. An example would be if you want to have a full-time student who is male, takes only night classes, has a full-time job, and has children in one treatment group, then you need to have the same type of student getting the other treatment. This type of design is hard to implement since you don’t know how many differentiation you would use, and should be avoided.

Matched Pairs Design: the treatments are given to two groups that can be matched up with each other in some ways. One example would be to measure the effectiveness of a muscle relaxer cream on the right arm and the left arm of observations, and then for each observation you can match up their right arm measurement with their left arm. Another example of this would be before and after experiments, such as weight before and weight after a diet.

No matter which experiment type you conduct, you should also consider the following:

Replication: repetition of an experiment on more than one observation so you can make sure that the sample is large enough to distinguish true effects from random effects. It is also the ability for someone else to duplicate the results of the experiment.

Blind study is where the subject used in the study does not know which treatment they are getting or if they are getting the treatment or a placebo.

Double-blind study is where neither the subject used in the study nor the researcher knows who is getting which treatment or who is getting the treatment and who is getting the placebo. This is important so that there can be no bias created by either the subject or the researcher.

One last consideration is the time period that you are collecting the data over. There are three types of time periods that you can consider.

Cross-sectional study: data observed, measured, or collected at one point in time.

Retrospective (or case-control) study: data collected from the past using records, interviews, and other similar artifacts.

Prospective (or longitudinal or cohort) study: data collected in the future from groups sharing common factors.

1.3.5 Homework for Experimental Design Section

You want to determine if cinnamon reduces a person’s insulin sensitivity. You give patients who are insulin sensitive a certain amount of cinnamon and then measure their glucose levels. Is this an observation or an experiment? Why?
You want to determine if eating more fruits reduces a person’s chance of developing cancer. You watch people over the years and ask them to tell you how many servings of fruit they eat each day. You then record who develops cancer. Is this an observation or an experiment? Why?
A researcher wants to evaluate whether countries with lower fertility rates have a higher life expectancy. They collect the fertility rates and the life expectancies of countries around the world. Is this an observation or an experiment? Why?
To evaluate whether a new fertilizer improves plant growth more than the old fertilizer, the fertilizer developer gives some plants the new fertilizer and others the old fertilizer. Is this an observation or an experiment? Why?
A researcher designs an experiment to determine if a new drug lowers the blood pressure of patients with high blood pressure. The patients are randomly selected to be in the study and they randomly pick which group to be in. Is this a randomized experiment? Why or why not?
Doctors trying to see if a new stent works longer for kidney patients, asks patients if they are willing to have one of two different stents put in. During the procedure the doctor decides which stent to put in based on which one is on hand at the time. Is this a randomized experiment? Why or why not?
A researcher wants to determine if diet and exercise together helps people lose weight over just exercising. The researcher solicits volunteers to be part of the study, randomly picks which volunteers are in the study, and then lets each volunteer decide if they want to be in the diet and exercise group or the exercise only group. Is this a randomized experiment? Why or why not?
To determine if lack of exercise reduces flexibility in the knee joint, physical therapists ask for volunteers to join their trials. They then randomly select the volunteers to be in the group that exercises and to be in the group that doesn’t exercise. Is this a randomized experiment? Why or why not?
You collect the weights of tagged fish in a tank. You then put an extra protein fish food in water for the fish and then measure their weight a month later. Are the two samples matched pairs or not? Why or why not?
A mathematics instructor wants to see if a computer homework system improves the scores of the students in the class. The instructor teaches two different sections of the same course. One section utilizes the computer homework system and the other section completes homework with paper and pencil. Are the two samples matched pairs or not? Why or why not?
A business manager wants to see if a new procedure improves the processing time for a task. The manager measures the processing time of the employees then trains the employees using the new procedure. Then each employee performs the task again and the processing time is measured again. Are the two samples matched pairs or not? Why or why not?
The prices of generic items are compared to the prices of the equivalent named brand items. Are the two samples matched pairs or not? Why or why not?
A doctor gives some of the patients a new drug for treating acne and the rest of the patients receive the old drug. Neither the patient nor the doctor knows who is getting which drug. Is this a blind experiment, double blind experiment, or neither? Why?
One group is told to exercise and one group is told to not exercise. Is this a blind experiment, double blind experiment, or neither? Why?
The researchers at a hospital want to see if a new surgery procedure has a better recovery time than the old procedure. The patients are not told which procedure that was used on them, but the surgeons obviously did know. Is this a blind experiment, double blind experiment, or neither? Why?
To determine if a new medication reduces headache pain, some patients are given the new medication and others are given a placebo. Neither the researchers nor the patients know who is taking the real medication and who is taking the placebo. Is this a blind experiment, double blind experiment, or neither? Why?
A new study is underway to track the eating and exercise patterns of people at different time periods in the future, and see who is afflicted with cancer later in life. Is this a cross-sectional study, a retrospective study, or a prospective study? Why?
To determine if a new medication reduces headache pain, some patients are given the new medication and others are given a placebo. The pain levels of a patient are then recorded. Is this a cross-sectional study, a retrospective study, or a prospective study? Why?
To see if there is a link between smoking and bladder cancer, patients with bladder cancer are asked if they currently smoke or if they smoked in the past. Is this a cross-sectional study, a retrospective study, or a prospective study? Why?
The Nurses Health Survey was a survey where nurses were asked to record their eating habits over a period of time, and their general health was recorded. Is this a cross-sectional study, a retrospective study, or a prospective study? Why?
Consider a question that you would like to answer. Describe how you would design your own experiment. Make sure you state the question you would like to answer, then determine if an experiment or an observation is to be done, decide if the question needs one or two samples, if two samples are the samples matched, if this is a randomized experiment, if there is any blinding, and if this is a cross-sectional, retrospective, or prospective study.

1.4 How Not to Do Statistics

Many studies are conducted and conclusions are made. However, there are occasions where the study is not conducted in the correct manner or the conclusion is not correctly made based on the data. There are many things that you should question when you read a study. There are many reasons for the study to have bias in it. Bias is where a study may have a certain slant or preference for a certain result. The following are a list of some of the questions or issues you should consider to help decide if there is bias in a study.

One of the first issues you should ask is who funded the study. If the entity that sponsored the study stands to gain either profits or notoriety from the results, then you should question the results. It doesn’t mean that the results are wrong, but you should scrutinize them on your own to make sure they are sound. As an example if a study says that genetically modified foods are safe, and the study was funded by a company that sells genetically modified food, then one may question the validity of the study. Since the company funds the study and their profits rely on people buying their food, there may be bias.

An experiment could have lurking or confounding variables when you cannot rule out the possibility that the observed effect is due to some other variable rather than the factor being studied. An example of this is when you give fertilizer to some plants and no fertilizer to others, but the no fertilizer plants also are placed in a location that doesn’t receive direct sunlight. You won’t know if the plants that received the fertilizer grew taller because of the fertilizer or the sunlight. Make sure you design experiments to eliminate the effects of confounding variables by controlling all the factors that you can.

Over generalization is where you do a study on one group and then try to say that it will happen on all groups. An example is doing cancer treatments on rats. Just because the treatment works on rats does not mean it will work on humans. Another example is that until recently most FDA medication testing had been done on white males of a particular age. There is no way to know how the medication affects other genders, ethnic groups, age groups, and races. The new FDA guidelines stresses using subjects from different groups.

Cause and effect is where people decide that one variable causes the other just because the variables are related. Unless the study was done as an experiment where a variable was controlled, you cannot say that one variable caused the other. There is the possibility that another variable caused both to change. As an example, there is a relationship between number of drownings at the beach and ice cream sales. This does not mean that ice cream sales increasing causes people to drown. Most likely the cause for both increasing is the heat.

Sampling error: This is the difference between the sample results and the true population results. This is unavoidable, and results in the fact that samples are different from each other. As an example, if you take a sample of 5 people’s height in your class, you will get 5 numbers. If you take another sample of 5 people’s heights in your class, you will likely get 5 different numbers.

Non-sampling error: This is where the sample is collected poorly either through a biased sample or through error in measurements. Care should be taken to avoid this error.

Lastly, there should be care taken in considering the difference between statistical significance versus practical significance. This is a major issue in statistics. Something could be statistically significance, which means that a statistical test shows there is evidence to show what you are trying to prove. However, in practice it doesn’t mean much or there are other issues to consider. As an example, suppose you find that a new drug for high blood pressure does reduce the blood pressure of patients. When you look at the improvement it actually doesn’t amount to a large difference. Even though statistically there is a change, it may not be worth marketing the product because it really isn’t that big of a change. Another consideration is that you find the blood pressure medication does improve a person’s blood pressure, but it has serious side effects or it costs a great deal for a prescription. In this case, it wouldn’t be practical to use it. In both cases, the study is shown to be statistically significant, but practically you don’t want to use the medication. The main thing to remember in a statistical study is that the statistics is only part of the process. You also want to make sure that there is practical significance. One more comment on statistical significance, the American Statistical Association (ASA) recently came out with a statement, “Based on our review of the articles in this special issue and the broader literature, we conclude that it is time to stop using the term ‘statistically significant’ entirely.” (Advanced Solutions International, Inc, 2019) Though the ASA suggests not using this term anymore, there are many studies that have been done in the past that uses this term, so it is presented here. However, it is not a term that should be use and will be down played in the rest of this book.

Surveys have their own areas of bias that can occur. A few of the issues with surveys are in the wording of the questions, the ordering of the questions, the manner the survey is conducted, and the response rate of the survey.

The wording of the questions can cause hidden bias, which is where the questions are asked in a way that makes a person respond a certain way. An example is that a poll was done where people were asked if they believe that there should be an amendment to the constitution protecting a woman’s right to choose. About 60% of all people questioned said yes. Another poll was done where people were asked if they believe that there should be an amendment to the constitution protecting the life of an unborn child. About 60% of all people questioned said yes. These two questions deal with the same issue, though giving different results, but how the question was asked affected the outcome.

The ordering of the question can also cause hidden bias. An example of this is if you were asked if there should be a fine for texting while driving, but proceeding that question is the question asking if you text while drive. By asking a person if they actually partake in the activity, that person now personalizes the question and that might affect how they answer the next question of creating the fine.

Non-response is where you send out a survey but not everyone returns the survey. You can calculate the response rate by dividing the number of returns by the number of surveys sent. Most response rates are around 30-50%. A response rate less than 30% is very poor and the results of the survey are not valid. To reduce non-response, it is better to conduct the surveys in person, though these are very expensive. Phones are the next best way to conduct surveys, emails can be effective, and physical mailings are the least desirable way to conduct surveys.

Voluntary response is where people are asked to respond via phone, email or online. The problem with these is that only people who really care about the topic are likely to call or email. These surveys are not scientific and the results from these surveys are not valid. Note: all studies involve volunteers. The difference between a voluntary response survey and a scientific study is that in a scientific study the researchers ask the subjects to be involved, while in a voluntary response survey the subjects become involved on their own choosing.

1.4.1 Example: Bias in a Study

Suppose a mathematics department at a community college would like to assess whether computer-based homework improves students’ test scores. They use computer-based homework in one classroom with one teacher and use traditional paper and pencil homework in a different classroom with a different teacher. The students using the computer-based homework had higher test scores. What is wrong with this experiment?

1.4.1.1 Solution

Since there were different teachers, you do not know if the better test scores are because of the teacher or the computer-based homework. A better design would be have the same teacher teach both classes. The control group would utilize traditional paper and pencil homework and the treatment group would utilize the computer-based homework. Both classes would have the same teacher, and the students would be split between the two classes randomly. The only difference between the two groups should be the homework method. Of course, there is still variability between the students, but utilizing the same teacher will reduce any other confounding variables.

1.4.2 Example: Cause and Effect

Determine if the one variable did cause the change in the other variable.

Cinnamon was giving to a group of people who have diabetes, and then their blood glucose levels were measured a time period later. All other factors for each person were kept the same. Their glucose levels went down. Did the cinnamon cause the reduction?
There is a link between spray on tanning products and lung cancer. Does that mean that spray on tanning products cause lung cancer?

1.4.2.1 Solution

Cinnamon was giving to a group of people who have diabetes, and then their blood glucose levels were measured a time period later. All other factors for each person were kept the same. Their glucose levels went down. Did the cinnamon cause the reduction?

Since this was a study where the use of cinnamon was controlled, and all other factors were kept constant from person to person, then any changes in glucose levels can be attributed to the use of cinnamon.
There is a link between spray on tanning products and lung cancer. Does that mean that spray on tanning products cause lung cancer?

Since there is only a link, and not a study controlling the use of the tanning spray, then you cannot say that increased use causes lung cancer. You can say that there is a link, and that there could be a cause, but you cannot say for sure that the spray causes the cancer.

1.4.3 Example: Generalizations

A researcher conducts a study on the use of ibuprofen on humans and finds that it is safe. Does that mean that all species can use ibuprofen?
Aspirin has been used for years to bring down fevers in humans. Originally it was tested on white males between the ages of 25 and 40 and found to be safe. Is it safe to give to everyone?

1.4.3.1 Solution

A researcher conducts a study on the use of ibuprofen on humans and finds that it is safe. Does that mean that all species can use ibuprofen?

No. Just because a drug is safe to use on one species doesn’t mean it is safe to use for all species. In fact, ibuprofen is toxic to cats.
Aspirin has been used for years to bring down fevers in humans. Originally it was tested on white males between the ages of 25 and 40 and found to be safe. Is it safe to give to everyone?

No. Just because one age group can use it doesn’t mean it is safe to use for all age groups. In fact, there has been a link between giving a child under the age of 19 aspirin when they have a fever and Reye’s syndrome.

1.4.4 Homework for How Not to Do Statistics Section

Suppose there is a study where a researcher conducts an experiment to show that deep breathing exercises helps to lower blood pressure. The researcher takes two groups of people and has one group to perform deep breathing exercises and a series of aerobic exercises every day and the other group was asked to refrain from any exercises. The researcher found that the group performing the deep breathing exercises and the aerobic exercises had lower blood pressure. Discuss any issue with this study.
Suppose a car dealership offers a low interest rate and a longer payoff period to customers or a high interest rate and a shorter payoff period to customers, and most customers choose the low interest rate and longer payoff period, does that mean that most customers want a lower interest rate? Explain.
Over the years it has been said that coffee is bad for you. When looking at the studies that have shown that coffee is linked to poor health, you will see that people who tend to drink coffee don’t sleep much, tend to smoke, don’t eat healthy, and tend to not exercise. Can you say that the coffee is the reason for the poor health or is there a lurking variable that is the actual cause? Explain.
When researchers were trying to figure out what caused polio, they saw a connection between ice cream sales and polio. As ice cream sales increased so did the incident of polio. Does that mean that eating ice cream causes polio? Explain your answer.
There is a positive correlation between having a discussion of gun control, which usually occur after a mass shooting, and the sale of guns. Does that mean that the discussion of gun control increases the likelihood that people will buy more guns? Explain.
There is a study that shows that people who are obese have a vitamin D deficiency. Does that mean that obesity causes a deficiency in vitamin D? Explain.
A study was conducted that shows that polytetrafluoroethylene (PFOA) (Teflon is made from this chemical) has an increase risk of tumors in lab mice. Does that mean that PFOA’s have an increased risk of tumors in humans? Explain.
Suppose a telephone poll is conducted by contacting U.S. citizens via landlines about their view of gay marriage. Suppose over 50% of those called do not support gay marriage. Does that mean that you can say over 50% of all people in the U.S. do not support gay marriage? Explain.
Suppose that it can be shown to be statistically significant that a smaller percentage of the people are satisfied with your business. The percentage before was 87% and is now 85%. Do you change how you conduct business? Explain?
You are testing a new drug for weight loss. You find that the drug does in fact statistically show a weight loss. Do you market the new drug? Why or why not?
There was an online poll conducted about whether the mayor of Auckland, New Zealand, should resign due to an affair. The majority of people participating said he should. Should the mayor resign due to the results of this poll? Explain.
An online poll showed that the majority of Americans believe that the government covered up events of 9/11. Does that really mean that most Americans believe this? Explain.
A survey was conducted at a college asking all employees if they were satisfied with the level of security provided by the security department. Discuss how the results of this question could be biased.
An employee survey says, “Employees at this institution are very satisfied with working here. Please rate your satisfaction with the institution.” Discuss how this question could create bias.
A survey has a question that says, “Most people are afraid that they will lose their house due to economic collapse. Choose what you think is the biggest issue facing the nation today. a) Economic collapse, b) Foreign policy issues, c) Environmental concerns.” Discuss how this question could create bias.
A survey says, “Please rate the career of Roberto Clemente, one of the best right field baseball players in the world.” Discuss how this question could create bias.

1.1 What is Statistics?

1.1.1 Example: Stating Definitions for Qualitative Variable

1.1.1.1 Solution

1.1.2 Example: Stating Definitions for Qualitative Variable

1.1.2.1 Solution

1.1.3 Example: Stating Definitions for Quantitative Variable

1.1.3.1 Solution

1.1.4 Example: Stating Definitions for Quantitative Variable

1.1.4.1 Solution

1.1.5 Example: Discrete or Continuous

1.1.5.1 Solution

1.1.6 Measurement Scales:

1.1.7 Example: Measurement Scale

1.1.7.1 Solution

1.1.8 Homework for What is Statistics Section

1.2 Sampling Methods

1.2.1 Example: Choosing a Simple Random Sample

1.2.1.1 Solution

1.2.2 Example: How Not to Choose a Simple Random Sample

1.2.2.1 Solution

1.2.3 Example: How to Choose a Simple Random Sample using R

1.2.4 Example: How to Choose a Stratified Sample using R

1.2.5 Example: Sampling type

1.2.5.1 Solution

1.2.6 Homework for Sampling Methods Section

1.3 Experimental Design

1.3.1 Guidelines for planning a statistical study

1.3.2 Example: Observational Study or Experiment

1.3.2.1 Solution

1.3.3 Survey

1.3.4 Experiment Options

1.3.5 Homework for Experimental Design Section

1.4 How Not to Do Statistics

1.4.1 Example: Bias in a Study

1.4.1.1 Solution

1.4.2 Example: Cause and Effect

1.4.2.1 Solution

1.4.3 Example: Generalizations

1.4.3.1 Solution

1.4.4 Homework for How Not to Do Statistics Section